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Abstract 

We investigate the performance gains from hyper-systolic implementations 
of n^-loop problems on the massively parallel computer Quadrics, exploiting 
its 3-dimensional interprocessor connectivity. For illustration we study the 
communication aspects of an exact molecular dynamics simulation of n parti- 
cles with Coulomb (or gravitational) interactions. We compare the interpro- 
cessor communication costs of the standard-systolic and the hyper-systolic 
approaches for various granularities. We predict gain factors as large as 3 
on the Q4 and 8 on the QH4 and measure actual performances on these ma- 
chine configurations. We conclude that it appears feasable to investigate the 
thermodynamics of a full gravitating n-body problem with 0(10000) particles 
using the new method on a QH4 system. 



1 Introduction 



Many grand challenge tasks in computational science involve the solution of 
n^-loop problems, demanding for algorithms and techniques viable on parallel 
machines. One would wish to increase the number of processors proportional 
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to n" such as to achieve a computational complexity of 0{n? ") per proces- 
sor[|. This strategy sets the goal towards the limit of low granularity! 

In this paper we treat, as a prototype of an n^-loop problem^, the molecular 
dynamics of n equal particles with long range Coulomb forces. Computation- 
ally, this implies repeated calculation of all interparticle distances within 
the full system. The state-of-the-art system size in molecular dynamics with 
an exact treatment of such long-range forces is at present of 0(10.000), re- 
quiring machines in the 10 GFlops range. 

On a massively parallel computer with p processors and distributed memory, 
the coordinates of the particles are assigned to different processors and there- 
fore interprocessor movement of data becomes a considerable cost factor in the 
computation of two particle forces. In the limit of low granularity, i. e. p ~ n, 
interprocessor communication is in fact the major performance bottleneck. 
This holds particularly true for machines with next-neighbour interprocessor 
communication networks as e.g. the APEIOO/Quadrics system, manufactured 
by Alenia Spazio S.P.A. or the 16k node machine under construction at 
Columbia University Q. It is therefore crucial to gear algorithms to ren- 
der low communicational overhead on such machines for computing n^-loop 
problems. 

An earlier proposal to implement re^-loop systems on the massively paral- 
lel computer Quadrics has been presented by Paolucci|Q]. His parallel ap- 
proach rests upon the replicated data methodj^ in conjunction with global 
summations over the whole machine and arrives at a total communicational 
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complexity of 0{npi). 

In this paper we will fathom the power of systolic array computation. The 
latter is a rather general technique to organize data flow on parallel systems 
that implies the potentiality to improve the interprocessor communication 
complexity for n^-loop problems. In systolic loop processing on a p pro- 
cessor machine working on n particle dynamics with long range forces, the 
latter is of order 0{np), as each of the n data elements has to be communi- 
cated to p processors. 

Recently, we have introduced the concept of hyper-systolic array processing^]. 
This novel parallel algorithm is an extension of the standard-systolic method. 

^In Ref. 0, Parisi discusses the limiting case of 0{const.) complexity per processor on computers 
with arbitrarily large parallelism. 

^Closely related problems are protein folding, polymer dynamics with long range forces, signal 
processing, or fully coupled neural nets. 



2 



In this approach, the data flow is reorganized such that the interprocessor 
communication is reduced to order 2n{2.^J^ — 1)) for an array of n particles, 
at no additional computational cost. This is accomplished at the expense of 
only a modest amount of additional storage. 

Hyper-systolic data piping is structurally related to the so-called postage 
stamp problem in Additive Number Theory which can serve to classify all 
possible hyper-systolic flow patterns. Needless to say the hyper-systolic con- 
cept is very general and can be applied to all sorts of n^-computations, as e. 
g. matrix computations |^. 

In this work we shall benchmark the hyper-systolic concept for solving "n?- 
loop problems on the Quadrics system. For assessing performance gains, we 
choose the 1-dimensional standard-systolic method as our reference point. 
We will present an efficient mapping of 1-dimensional systolic arrays onto 
the 3-dimensional network of the Quadrics. In our setup, one hyper-systolic 
movement is performed in < 3 interprocessor communication steps. 



2 Systolic Array Implementation 

We consider for definiteness the computation of the local Coulombic forces 
within an n-body problem 

n -, 

Vi = F{x.i) = ^ J i = l,...,n, (1) 

.^^ \Xi Xj\ 

the evaluation of which represents the computational bottleneck. 

Let us assume for a start the limit of granularity to be = n/p = 1. 
Our standard-systolic loop approach to compute Eq. || proceeds by mapping 
the processors onto a 1-dimensional periodic chain (GRA0), see Fig. ||. The 
information on the location of the physical system in configuration space, 
X = {xi,X2, ■ ■ ■ ,Xn), is distributed over the compute system such that each 
processor pi holds in its local memory the coordinate Xi as its 'resident vector'. 
Besides the local resident vector Xi there is a local migrating vector x'^, that 
originally coincides with Xj. During a systolic step, x'^ is shifted cyclically 
onto the right hand next-neighbour processor on the 1-dimensional systolic 

■^This name was chosen as a mnemonic to the 20th century Roman traffic solution Grande 
Raccordo Anulare. 
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Figure 1: Systolic array GRA. 



chain GRA. After completion of this systolic move, the computation of the 
local pairings is performed in parallel and the results are accumulated into 
a resident result array y'. After n — 1 such systolic steps, Eq. |l] is fully 
computed^. 

In Fig. gi, we illustrate the mapping of the 1-dimensional systolic array onto 
the Quadrics processor topology: on a Quadrics Q4, with n = p = 32. The 
processors of the Q4 are arranged in a 8 x 2 x 2-grid that is toroidally closed]^. 



One must differentiate between a movement on the logical, systolic chain 
GRA and its actual hardware realization, in terms of interprocessor commu- 
nications: one systolic movement to the next neighbour processor on GRA 
amounts to three steps on the interprocessor links, as depicted in Fig. ^: 

Step 1: cyclical shift to next neighbour in y-direction. 

Step 2: cyclical shift to next neighbour in z-direction, at y = 1 only. 

Step 3: cyclical shift to next neighbour in x-direction, y = 1, z = 1 only. 



For the entire systolic process, 3{p—l) such interprocessor communication op- 
erations are needed. Therefore the communicational complexity of standard- 
systolic computations on the Quadrics is 3p{p — 1) . 

Proceeding to the case of n > p, we partition the systolic array into ^ subar- 
rays and distribute each subarray across the machine. From the point of view 
of interprocessor communication the pairings of coordinates from different 
subarrays present no additional complication. The total number of interpro- 
cessor communication operations is now found as 3n{p — 1), while the total 
number of computation operations remains unaltered, n{n — 1). 

^One invokes Newton's actio = reactio to reduce the amount of computation by a factor two. 
^The mapping of systolic arrays with n = p onto other Quadrics configurations, like Q16, QH2, 
and QH4 is straightforward. 
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Figure 2: Systolic movement, a) Initial state: Mapping of a 1-dimensional systolic 
array with n = 32 onto the 3-dimensional Quadrics Q4 with p = 32 processors. 
The systolic chain is periodically closed, b) Final state: reached after 3 steps. The 
coordinate system is given to specify the Quadrics' 8x2x2 processor grid. 
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3 Hyper-Systolic Algorithm 



The hyper-systolic reorganization of the computation of Eq. |T| wiU enable us 
to reduce the complexity of interprocessor communication to 0(np2); thus, 
interprocessor communication ultimately becomes negligible compared to the 
pure computational complexity, on machines with a very large number of 
processors. 

Quadrics machines allow for an efficient realization of hyper-systolic array 
computing, due to their 3-dimensional cubic connectivity^. 

3.1 Definition: Regular Base Algorithm 

The hyper-systolic data movement generalizes the standard-systolic one by 
increasing the number of resident arrays from one to k, with k <^ p and 
introducing cyclical shifts with strides at > 1. 

Let us give an operational description of the hyper-systolic algorithm for ex- 
treme granularity (7 = 1: 

1. For a given array x of length n, k arrays xt are generated by shifting 
the original array x k times by strides at, and copying x to x^ in step t, 
1 < t < k. The variable stride at of the shifts must be chosen such that 

(a) all pairings of data elements occur at least once, and 

(b) the number of equal pairings as well as 

(c) the number of shifts k as well as 

(d) the number of different strides at 
are minimized. 

2. The required n(n — l)/2 results iji in Eq. Q are successively computed 
and are added to k resulting arrays y^. 

3. Finally, applying the inverse shift sequence, the arrays yj are shifted 
back to their proper locations and corresponding entries are summed up 
to build the elements of the final result array y. 

^From the point of view of 1-dimcnsional systolic data arrangement and movement the 3- 
dimensional network of the Quadrics supports accelerated communication along transversal direc- 
tions (wormhole communication). 
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The interested reader can find the optimal solutions to this problem for 8 < 
p < 64 in Ref. . In general, one can resort to so-called regular base solutions 
which are very simple yet still in the optimal complexity class. 

In the following we implement the hyper-systolic algorithm using the regular 
base: 

Ak=2K-i=(0,l,h...,hK,K,...,K], 



(2) 

K -1 K 



with K — 1 shifts by stride oj = 1 and K shifts by stride at = K. The length 
of the base is A; = 2K + 1. The shift constant K is given by = ^J^. 

3.2 Implementation on the Q4. 

For illustration we consider the case n = 32. The regular hyper-systolic base 
is given by 

^7 = (0,1,1,1,4,4,4,4), (3) 



leading to the storage of eight arrays as depicted in Fig. ^. By inspection one 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 
32 12345678 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 
31 32 12345678 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 
30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 
26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 

Figure 3: Hyper-systolic data movement on the Q4 for n = 32 using the regular 
base A-j = (0,1,1,1,4,4,4,4). 

finds all pairings of the numbers 1 to 32 within the columns, i. e. within the 
32 local memories. 

The systolic chain x is mapped in the very same way as in section |2| for stride 
at = ^-. The shifts by the stride = 4 present a novel feature of the hyper- 
systolic method: from Fig. ^ it can be seen that such a hyper-shift, on the 
Q4, is performed by just one communication operation to the next-neighbour 
in x-direction. 
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a) 



b) 



X 



Figure 4: Shift by a stride of 4 on the Quadrics Q4. a) Initial state, b) final state. 



Thus the communicational complexity of the regular base hyper-systolic method 
on the Q4 is improved compared to the scaling law in Ref. 0], that has been 
derived for constant communicational cost for all strides, and is given by 

2n{4:^J^— 1). The improvement with respect to the standard-systolic method 
on the Q4 is 

-^T) ( n — 1 1 / r) 

(4) 



R= ^"(^Z^) ^3,/^ = 3. 

V 32 



2n(4^| - 1) 

This result holds also for the case n > p where we partition the array into 
subarrays and compute all relevant interactions^. 



^Recently, the so called Half-Orrery algorithm has been presented in Ref. |jT^. This method 
also belongs to the class of hyper-systolic algorithms. In principle, it can, as is the case with the 
approach presented here, make usage of the symmetry inherent in Eq. |l|. The communication, 
however, is not reduced compared to the standard-systolic way, and moreover on the Quadrics, 
additional back-shifts have to be performed. Thus Half-Orrery is not competitive on Quadrics. 
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3.3 Implementation on the QH2 and QH4 



The 256-node Quadrics QH2 is arranged as an 8 x 4 x 8 lattice. The procedure 
underlying Eq. ^ needs to be generalized as \/128 is not integer. We can utilize 
the decomposition | = KK' and select the base 

Ak=K-l+K'=(0,l, 1, . . . , 1, K, K, . . . , i^) , 

K -1 K' 
For the QH2, the favourite base reads 

A23 = (0, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8) . (6) 

Note that a stride by 8 can be performed with 2 communication operations 
only, analogous to the shifts on the Q4. It turns out that the ratio R between 
standard-systolic and hyper-systolic data communication expense theoreti- 
cally is about 7.2. 

The QH4 with 512 processors and 8x8x8 topology readily lends itself 
to the regular hyper-systolic base, Eq. again with a stride of K = 16. 
Constrained by the topology, a shift with a stride of 16 has to be accomplished 
within 3 communication operations by 2 strides of length 8, and theoretically 
a gain in communicational expense of a factor of 8 can be achieved. It is 
interesting to note that asymmetric processor grids can increase the hyper- 
systolic performance: e. g., if the 16 x 8 x 4-topology is realized, the gain factor 
is about 10. With a 4 x 4 x 32-topology, the gain is optimized further to 12. 



4 Performance Results 



4.1 Performance Tests on the Q4 

The hyper-systolic computation of the interaction given by Eq. |l] on Qadrics 
machines has been developed and first tested on the 32-node Q4 at Wuppertal 
University, Germany. In order to speed up the purely computational part of 
the program we have employed block matrices in form of register declarations 
for the compute intensive part. As the number of registers is limited to 128 per 
processor, this approach amounts to an additional second blocking in form of 
loop-unrolling in the number of particles per processor ^. Thus, the array of 



9 



^ elements is subpartitioned into q parts. In our three-dimensional molecular 
dynamics example with Coulombic forces we are able to perform computations 
on 2 arrays of up to 6 particles, i.e., 6x6 coordinate pairings can be treated 
while no recourse to the local memory is needed. We emphasize that both the 
standard-systolic and the hyper-systolic methods use the identical optimized 
code concerning this local computational part. 

We have measured the total run times, ttotai = tcpu + ^ipc, and the pure 
interprocessor communication times tipc as a function of the total number of 
particles n. 

As a result we show the performance plot of the Q4 in Fig. ^. The total time 
expense (total) and the pure communication time (ipc) for one computation of 
all forces are plotted for both the standard-systolic (sys) and the hyper-systolic 
(hys) algorithms. As expected, for both methods, we find a linear increase in 




200 400 600 800 1000 



# of particles 



Figure 5: Overall time (total) and communication time (ipc) vs. the number of 
particles for one interaction time-step on the 32-nodes Quadrics Q4 for the systolic 
and the hyper-systolic method. 

communicational and a quadratic behaviour of the total time expense, as we 
are working on a fixed machine size, with constant number of processors p. 



10 



The insertion zooms the curves in a regime of n that is common in practical 
computations on a machine with the number and performance of processors 
as provided by the Q4 of particles < 500 for the Q4). We find here 
that both advantages of the hyper-systolic method lead to their full pay-off: 
the communication time is drastically reduced and together with the smaller 
amount of computation operations, the total time expense is diminished by 
more than a factor of two. For example, one interaction step on 192 particles 
requires 4.2 milliseconds in the systolic case compared to 1.8 milliseconds in 
the hyper-systolic mode. 

Fig. 1^ plots the 'inefficiency factor', ttotai/tcpu, for both algorithms. The 
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200 400 600 800 1000 

# of particles 

Figure 6: as function of the number of particles on the Q4. 

tcpu 

threshold behaviour of the inefficiency factor in the region n < 150 reflects 
the characteristics of the Quadrics register structure: for the registers to be 
filled without bubbles, one needs at least a set of 2 x 6 particles, while the 
communication time increases proportional to n. Notice that communication 
plays a much smaller role for the hyper-systolic method and even for a gran- 
ularity of g = 32, i.e.n = 1024, the performance in the standard-systolic 
approach is still considerably affected by communication. 

In Fig. 1^, we display the gain factors |^ separately, in terms of the total, the 
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net computational and the interprocessor communication overhead times. On 
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Figure 7: Gain factor for overall time (total), net computation (cpu) and inter- 
processor communication (ipc) times as measured on the Q4. 

the Q4, we actually achieve a gain factor of about 3.2 in communication time 
across the entire granularity range, to be compared to the predicted value of 
3. The startup behaviour of the three curves at small n again reflects the 
filling of the registers. The overall gain reaches its maximum in the range 
of 128 to 192 particles, whereupon the competing complexities are leading to 
more modest improvements. Yet, the hyper-systolic mode still results in an 
attractive gain of 60% for 1024 particles, at granularity 32! 

Let us turn next to efficiency measurements. Counting the computation of 
an inverse square root with 18 floating point operations, we assign 35 floating 
point operations to the computation of each rij in Eq. |l| on the 3-dimensional 
problem. Fig. ^ presents the flop rates as measured with the hyper-systolic 
method on the Q4 (in units of its peak performance) for the computation 
of the Coulomb interaction. Before the registers are fully exploited (< 150 
particles on the Q4), the performance is small as expected. For larger values 
of n, however, a remarkably high performance can be achieved for the full 
n-body computation of the Coulomb force in three dimensions. In terms of 
the peak performance of the Quadrics Q4, i. e. 1.6 Chop, nearly 50 % can be 
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Figure 8: Performance in % of the peak performance on the Q4. 

reached in the hmit of large n. In real life, one would consider the Q4 to be 
the adequate Quadrics configuration for treating 200 to 500 particles; in this 
range we observe efficiencies between 20 and 30 % of the peak performance. 



4.2 Performance Tests on the QH4 

Next we extend our performance measurements to the QH4 configuration 
which is equipped with 512 processing nodes and usually configured as 8 x 8 x 8 
grid. 

As explained above , we expect an improvement in interprocessor communi- 
cation by about a factor of 8 on the QH4 (as compared to 3 on the 32-node 
Q4)^. The results of our QH4 benchmark are displayed in Fig. which is 
the analogue to the Q4 plot in Fig. |5[ The insertion zooms the granularity 
range g < 8. Characteristic for the hyper-systolic algorithm is the fact that, 
even for g = 1, tipc is one order of magnitude smaller than tcpu- In the whole 

On top of that, there should be additional, computational gains with respect to the Q4. 
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# of particles 

Figure 9: Total time (total) and communication time (ipc) vs. the number of par- 
ticles for one interaction time-step on the 512-nodes Quadrics QH4 for the systolic 
and the hyper-systolic method. 



range of granularities investigated, for the hyper-systolic method, the size of 
tipc is marginal, whilst the standard-systolic computation is dominated by 
communication overhead, up to the point g = 16. 

The QH4 inefficiency factor, as shown in Fig. |l^ is very reminiscent to Fig. ^, 
featuring again the effects of register filling in form of the characteristic thresh- 
old behaviour at small n. The important message to be drawn from this plot 
is the finding that in the relevant intermediate region of say g < 16, hyper- 
systolic are by far superior to standard-systolic implementations in perfor- 
mance. Note that in the asymptotic regime of very large granularity the 
machine can be viewed more and more as an assembly of independently work- 
ing processors; in this 'scalar' limit, communication ceases to dominate for 
both systolic varieties, but our n^-loop computation approaches the creeping 
mode! 



In Fig. 11 we display the gain factor for the hyper-systolic implementation 
on the QH4 for particle numbers up to 35.000. It is not surprising after 
our theoretical discussion in section 3.S to find an improvement of > 8 in the 
interprocessor communication time throughout the observed g range. Here we 
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reach a slightly improved level for the gain in CPU-time as the computations 
contribute more heavily (larger n as compared to the Q4 case). The improved 
communication time boosts the peak in the overall gain factor from 2.8 on 
the Q4 to 3.5 on the QH4. 

Finally it is rather gratifying to find a high efficiency level for the QH4 in terms 
of its peak performance, see Fig. The efficiency of the QH4 in the hyper- 
systolic implementation of molecular dynamics saturates at a value close to 
60%, crossing the 50% mark at granularity 20. Even at low granularity like 
g ~ 6 we observe efficiencies in the ball-park 30 to 40 %! 



5 Summary 



We have investigated standard-systolic and hyper-systolic computations for a 
prototype n^-loop problem on the massively parallel computer Quadrics. 

A next-neighbour data movement on the logical systolic chain "GRA" is real- 
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Figure 11: Ratio for total time (total), pure computation (cpu) and interpro- 
cessor communication (ipc) on the QH4. 

ized in form of three interprocessor communication steps, the hyper-systolic 
data movement can be performed in one interprocessor communication step 
on the Q4, two steps on the 8x4x8 QH2 and three steps on 8 x 8 x 8 QH4, 
respectively. 

As expected, our measurement reveals a gain factor of Ri 3 in communication 
time compared to the standard-systolic computation, on the Quadrics Q4. In 
this way, we verified a performance of up to 750 Mflops in molecular dynamics 
with exact long range gravitational forces for acceptable granularity. Satura- 
tion would be reached at about 50 % of the theoretical peak performance of 
the machine. 

Even more encouraging results were subsequently found in measurements on 
the largest actually built member of the Quadrics line, the QH4, where the 
cpu-power can be exploited even for small granularity. The performance satu- 
rates for n > 32k near 60 % of the peak performance, i. e. nearly 15.5 Gflops, 
for long-range Coulomb interactions. 

According to the results, on the QH4, it appears realistic to perform molecular 
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Figure 12: Performance in % of the peak performance on the QH4. 

dynamics simulations with 0{16k) particles in 6 days compute time with a 
length of 10^ molecular dynamics time steps. 

We thus conclude that the APEIOO/Quadrics architecture should no longer be 
considered as special purpose machines for lattice QCD. Instead they should 
be viewed as powerful devices that can solve just as well many other bottleneck 
problems of Computational Science belonging to the wide class of n^-loop 
problems. 

In two consecutive papers we will evaluate the impact of the hyper-systolic 
approach on other bottleneck problems in HPCN such as matrix products 
and fast Fourier transforms. 
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